Thai language is considered as a non-segmented language where words are a string of symbols without explicit word boundaries, and also the structure of written Thai language is highly ambiguous. This problem causes an indexing technique has become a main issue in Thai text retrieval. To construct an inverted index for Thai texts, an index terms extraction technique is usually required to segment texts into index term schemes. Although index terms can be specified manually by experts, this process is very time consuming and labor-intensive. Word segmentation is one of the many techniques that are used to automatically extract index terms from Thai texts. However, most of the word segmentation techniques require linguistic knowledge and the preparation of these approaches is time consuming. An n-gram based approach is another automatic index terms extraction method that is often used as indexing technique for Asian languages including Thai. This approach is language independent which does not require any linguistic knowledge or dictionary. Although the n-gram approach out performs many indexing techniques for Asian languages in term of retrieval effectiveness, the disadvantage of n-gram approach is it suffers from large storage space and long retrieval time. In this paper we present the frequent max substring mining to extract index terms from Thai texts. Our method is language-independent and it does not rely on any dictionary or language grammatical knowledge. Frequent max substring mining is based on text mining that describes a process of discovering useful information or knowledge from unstructured texts. This approach uses the analysis of frequent max substring sets to extract all long and frequently-occurred substrings. We aim to employ the frequent max substring mining algorithm to address the drawback of n-gram based approach by keeping only frequent max substrings to reduce disk space requirement for storing index terms and to reduce the retrieval time in order to deal with the rapid growth of Thai texts.
展开▼